Sunday, November 16, 2014

Unix Prog: Socket Addressing(2)

1. Address Lookup
The network configuration can be kept in a number of places, like static files(/etc/hosts, /etc/services etc.), or can be managed by a name service, such as DNS(Domain Name System) or NIS(Network Information Service).

1) host
 ubuntu@ip-172-31-23-227:~$ less /usr/include/netdb.h  
 ......  
 /* Description of data base entry for a single host. */  
 struct hostent  
 {  
  char *h_name;         /* Official name of host. */  
  char **h_aliases;       /* Alias list. */  
  int h_addrtype;        /* Host address type. */  
  int h_length;         /* Length of address. */  
  char **h_addr_list;      /* List of addresses from name server. */  
 #if defined __USE_MISC || defined __USE_GNU  
 # define    h_addr h_addr_list[0] /* Address, for backward compatibility.*/  
 #endif  
 };  
   
 /* Open host data base files and mark them as staying open even after  
   a later search if STAY_OPEN is non-zero.  
   
   This function is a possible cancellation point and therefore not  
   marked with __THROW. */  
 extern void sethostent (int __stay_open);  
   
 /* Close host data base files and clear `stay open' flag.  
   
   This function is a possible cancellation point and therefore not  
   marked with __THROW. */  
 extern void endhostent (void);  
   
 /* Get next entry from host data base file. Open data base if  
   necessary.  
   
   This function is a possible cancellation point and therefore not  
   marked with __THROW. */  
 extern struct hostent *gethostent (void);  
   
 /* Return entry from host data base which address match ADDR with  
   length LEN and type TYPE.  
   
   This function is a possible cancellation point and therefore not  
   marked with __THROW. */  
 extern struct hostent *gethostbyaddr (const void *__addr, __socklen_t __len,  
                    int __type);  
   
 /* Return entry from host data base for host with NAME.  
   
   This function is a possible cancellation point and therefore not  
   marked with __THROW. */  
 extern struct hostent *gethostbyname (const char *__name);  
 ......  

gethostent: return the next entry in the host database file.
sethostent: open the file or rewind it if it is already open
endhostent: close the file

2) networks

 ubuntu@ip-172-31-23-227:~$ less /usr/include/netdb.h  
 ......  
 /* Open network data base files and mark them as staying open even  
   after a later search if STAY_OPEN is non-zero.  
   
   This function is a possible cancellation point and therefore not  
   marked with __THROW. */  
 extern void setnetent (int __stay_open);  
   
 /* Close network data base files and clear `stay open' flag.  
   
   This function is a possible cancellation point and therefore not  
   marked with __THROW. */  
 extern void endnetent (void);  
   
 /* Get next entry from network data base file. Open data base if  
   necessary.  
   
   This function is a possible cancellation point and therefore not  
   marked with __THROW. */  
 extern struct netent *getnetent (void);  
   
 /* Return entry from network data base which address match NET and  
   type TYPE.  
   
   This function is a possible cancellation point and therefore not  
   marked with __THROW. */  
 extern struct netent *getnetbyaddr (uint32_t __net, int __type);  
   
 /* Return entry from network data base for network with NAME.  
   
   This function is a possible cancellation point and therefore not  
   marked with __THROW. */  
 extern struct netent *getnetbyname (const char *__name);  
 ......  
 ubuntu@ip-172-31-23-227:~$ less /usr/include/x86_64-linux-gnu/bits/netdb.h  
 ......  
 /* Description of data base entry for a single network. NOTE: here a  
   poor assumption is made. The network number is expected to fit  
   into an unsigned long int variable. */  
 struct netent  
 {  
  char *n_name;         /* Official name of network. */  
  char **n_aliases;       /* Alias list. */  
  int n_addrtype;        /* Net address type. */  
  uint32_t n_net;        /* Network number. */  
 };  
 ......  

3) Protocols
 ubuntu@ip-172-31-23-227:~$ less /usr/include/netdb.h  
 ......  
 /* Description of data base entry for a single service. */  
 struct protoent  
 {  
  char *p_name;         /* Official protocol name. */  
  char **p_aliases;       /* Alias list. */  
  int p_proto;         /* Protocol number. */  
 };  
   
 /* Open protocol data base files and mark them as staying open even  
   after a later search if STAY_OPEN is non-zero.  
   
   This function is a possible cancellation point and therefore not  
   marked with __THROW. */  
 extern void setprotoent (int __stay_open);  
   
 /* Close protocol data base files and clear `stay open' flag.  
   
   This function is a possible cancellation point and therefore not  
   marked with __THROW. */  
 extern void endprotoent (void);  
   
 /* Get next entry from protocol data base file. Open data base if  
   necessary.  
   
   This function is a possible cancellation point and therefore not  
   marked with __THROW. */  
 extern struct protoent *getprotoent (void);  
   
 /* Return entry from protocol data base for network with NAME.  
   
   This function is a possible cancellation point and therefore not  
   marked with __THROW. */  
 extern struct protoent *getprotobyname (const char *__name);  
   
 /* Return entry from protocol data base which number is PROTO.  
   
   This function is a possible cancellation point and therefore not  
   marked with __THROW. */  
 extern struct protoent *getprotobynumber (int __proto);  
 ......  

4) services:
 ubuntu@ip-172-31-23-227:~$ less /usr/include/netdb.h  
 ......  
 /* Description of data base entry for a single service. */  
 struct servent  
 {  
  char *s_name;         /* Official service name. */  
  char **s_aliases;       /* Alias list. */  
  int s_port;          /* Port number. */  
  char *s_proto;        /* Protocol to use. */  
 };  
   
 /* Open service data base files and mark them as staying open even  
   after a later search if STAY_OPEN is non-zero.  
   
   This function is a possible cancellation point and therefore not  
   marked with __THROW. */  
 extern void setservent (int __stay_open);  
   
 /* Close service data base files and clear `stay open' flag.  
   
   This function is a possible cancellation point and therefore not  
   marked with __THROW. */  
 extern void endservent (void);  
   
 /* Get next entry from service data base file. Open data base if  
   necessary.  
   
   This function is a possible cancellation point and therefore not  
   marked with __THROW. */  
 extern struct servent *getservent (void);  
   
 /* Return entry from network data base for network with NAME and  
   protocol PROTO.  
   
   This function is a possible cancellation point and therefore not  
   marked with __THROW. */  
 extern struct servent *getservbyname (const char *__name, const char *__proto);  
   
 /* Return entry from service data base which matches port PORT and  
   protocol PROTO.  
   
   This function is a possible cancellation point and therefore not  
   marked with __THROW. */  
 extern struct servent *getservbyport (int __port, const char *__proto);  
 ......  

5) Address Information
 ubuntu@ip-172-31-23-227:~$ less /usr/include/netdb.h  
 ......  
 /* Translate name of a service location and/or a service name to set of  
   socket addresses.  
   
   This function is a possible cancellation point and therefore not  
   marked with __THROW. */  
 extern int getaddrinfo (const char *__restrict __name,  
             const char *__restrict __service,  
             const struct addrinfo *__restrict __req,  
             struct addrinfo **__restrict __pai);  
   
 /* Free `addrinfo' structure AI including associated storage. */  
 extern void freeaddrinfo (struct addrinfo *__ai) __THROW;  
 ......
 /* Structure to contain information about address of a service provider.  */
 struct addrinfo
 {
   int ai_flags;                 /* Input flags.  */
   int ai_family;                /* Protocol family for socket.  */
   int ai_socktype;              /* Socket type.  */
   int ai_protocol;              /* Protocol for socket.  */
   socklen_t ai_addrlen;         /* Length of socket address.  */
   struct sockaddr *ai_addr;     /* Socket address for socket.  */
   char *ai_canonname;           /* Canonical name for service location.  */
   struct addrinfo *ai_next;     /* Pointer to next in list.  */
 };
 ......  

getaddrinfo: providing the host name, service name or both, to return one linked list of addrinfo strucutures.
freeaddrinfo: release each addrinfo structure in the linked list.

If getaddrinfo fails, we need to call gai_strerror to convert the error code returned into an error message.
 ubuntu@ip-172-31-23-227:~$ less /usr/include/netdb.h  
 ......  
 /* Convert error return from getaddrinfo() to a string. */  
 extern const char *gai_strerror (int __ecode) __THROW;  
 ......  

6) Converts an address into a host name and a service name
 ubuntu@ip-172-31-23-227:~$ less /usr/include/netdb.h  
 ......  
 /* Translate a socket address to a location and service name.  
   
   This function is a possible cancellation point and therefore not  
   marked with __THROW. */  
 extern int getnameinfo (const struct sockaddr *__restrict __sa,  
             socklen_t __salen, char *__restrict __host,  
             socklen_t __hostlen, char *__restrict __serv,  
             socklen_t __servlen, int __flags);  
 ......  

getnameinfo converts an unix networking address into a host name and a service name. flags give us some control about how the translation is done:

 ubuntu@ip-172-31-23-227:~$ less /usr/include/netdb.h  
 ......  
 # define NI_NUMERICHOST 1    /* Don't try to look up hostname. */  
 # define NI_NUMERICSERV 2    /* Don't convert port number to name. */  
 # define NI_NOFQDN   4    /* Only return nodename portion. */  
 # define NI_NAMEREQD  8    /* Don't return numeric addresses. */  
 # define NI_DGRAM    16   /* Look up UDP service rather than TCP. */  
 ......  

Unix Prog: Socket Addressing(1)

1. Basic Concept

The machine's network address helps us identify the computer on the network. And the service helps us identify the particular process on that computer

2. Byte Ordering
The byte order is a characteristic of the processor architecture, dictating how bytes are ordered within larger data types, such as integers.

big-endian: for value 0x04030201, 4 occupies lowest memory address, 1 occupies highest memory address

little-endian: for value 0x04030201, 4 occupies highest memory address, 1 occupies lowest memory address

Network protocols specify a byte ordering so that heterogeneous computer systems can exchange protocol information without confusing the byte ordering. So applications need to translate processor's byte order to the network byte order or in reverse.

System Calls:
 ubuntu@ip-172-31-23-227:~$ less /usr/include/netinet/in.h  
 ......  
 /* Functions to convert between host and network byte order.  
   
   Please note that these functions normally take `unsigned long int' or  
   `unsigned short int' values as arguments and also return them. But  
   this was a short-sighted decision since on different systems the types  
   may have different representations but the values are always the same. */  
   
 extern uint32_t ntohl (uint32_t __netlong) __THROW __attribute__ ((__const__));  
 extern uint16_t ntohs (uint16_t __netshort)  
    __THROW __attribute__ ((__const__));  
 extern uint32_t htonl (uint32_t __hostlong)  
    __THROW __attribute__ ((__const__));  
 extern uint16_t htons (uint16_t __hostshort)  
    __THROW __attribute__ ((__const__));  
 ......  

n represents network, h represents host, l represents long, s represents short.
ntohl represents: Convert network byte order to host byte order, unsigned long
ntohs represents: Convert network byte order to host byte order, unsigned short
htonl represents: Convert host byte order to network byte order, unsigned long
htons represents: Convert host byte order to network byte order, unsigned short

3. Address Formats:
1) Generic sockaddr structure
An address identifies a socket endpoint in a particular communication domain. The  address format is specific to the particular domain.

In order to let different socket address structure to be passed to general socket functions, unix has one generic sockaddr address structure, which can be generated by casting domain specific address structure.

 ubuntu@ip-172-31-23-227:~$ less /usr/include/x86_64-linux-gnu/bits/socket.h  
 ......  
 /* Structure describing a generic socket address. */  
 struct sockaddr  
  {  
   __SOCKADDR_COMMON (sa_);  /* Common data: address family and length. */  
   char sa_data[14];      /* Address data. */  
  };  
 ......  

2) IPV4 Address Structure
IPV4 Address can be converted to generic sockaddr structure
 ubuntu@ip-172-31-23-227:~$ less /usr/include/netinet/in.h  
 ......  
 /* Internet address. */  
 typedef uint32_t in_addr_t;  
 struct in_addr  
  {  
   in_addr_t s_addr;  
  };  
 ......  
 /* Structure describing an Internet socket address. */  
 struct sockaddr_in  
  {  
   __SOCKADDR_COMMON (sin_);  
   in_port_t sin_port;         /* Port number. */  
   struct in_addr sin_addr;      /* Internet address. */  
   
   /* Pad to size of `struct sockaddr'. */  
   unsigned char sin_zero[sizeof (struct sockaddr) -  
               __SOCKADDR_COMMON_SIZE -  
               sizeof (in_port_t) -  
               sizeof (struct in_addr)];  
  };  
 ......  

3) IPV6 Address Structure
IPV6 address can also be converted to generic sockaddr structure
 ubuntu@ip-172-31-23-227:~$ less /usr/include/netinet/in.h  
 ......  
 /* IPv6 address */  
 struct in6_addr  
  {  
   union  
    {  
     uint8_t __u6_addr8[16];  
 #if defined __USE_MISC || defined __USE_GNU  
     uint16_t __u6_addr16[8];  
     uint32_t __u6_addr32[4];  
 #endif  
    } __in6_u;  
 #define s6_addr         __in6_u.__u6_addr8  
 #if defined __USE_MISC || defined __USE_GNU  
 # define s6_addr16       __in6_u.__u6_addr16  
 # define s6_addr32       __in6_u.__u6_addr32  
 #endif  
  };  
 ......  
 struct sockaddr_in6  
  {  
   __SOCKADDR_COMMON (sin6_);  
   in_port_t sin6_port;    /* Transport layer port # */  
   uint32_t sin6_flowinfo;   /* IPv6 flow information */  
   struct in6_addr sin6_addr; /* IPv6 address */  
   uint32_t sin6_scope_id;   /* IPv6 scope-id */  
  };  
 ......  

4) Conversion between the binary address format and a string dotted-decimal notation(for example, 127.0.0.1)

 ubuntu@ip-172-31-23-227:~$ less /usr/include/arpa/inet.h  
 ......  
 /* Convert from presentation format of an Internet number in buffer  
   starting at CP to the binary network format and store result for  
   interface type AF in buffer starting at BUF. */  
 extern int inet_pton (int __af, const char *__restrict __cp,  
            void *__restrict __buf) __THROW;  
   
 /* Convert a Internet address in binary network format for interface  
   type AF in buffer starting at CP to presentation form and place  
   result in buffer of length LEN astarting at BUF. */  
 extern const char *inet_ntop (int __af, const void *__restrict __cp,  
                char *__restrict __buf, socklen_t __len)  
    __THROW;  
 ......  

the first argument, af, represents the domain type, only following 2 domains are supported:
AF_INET, AF_INET6 (Only 2 domains are supported: AF_INET, AF_INET6)

Unix Prog: Socket Descriptors

1. Unix Socket
A socket is an abstraction of a communication endpoint.
Socket descriptors are implemented as file descriptors in the unix system.
Many of the functions that deal with file descriptors, such as read and write, will work with socket descriptor.

2. Create a socket descriptor.
System Call:
 ubuntu@ip-172-31-23-227:~$ less /usr/include/x86_64-linux-gnu/sys/socket.h  
 ......  
 /* Create a new socket of type TYPE in domain DOMAIN, using  
   protocol PROTOCOL. If PROTOCOL is zero, one is chosen automatically.  
   Returns a file descriptor for the new socket, or -1 for errors. */  
 extern int socket (int __domain, int __type, int __protocol) __THROW;  
 ......  

domain argument determines the nature of the communication, including the address format.
 ubuntu@ip-172-31-23-227:~$ less /usr/include/x86_64-linux-gnu/bits/socket.h  
 ......  
 #define AF_UNSPEC    PF_UNSPEC  
 #define AF_LOCAL    PF_LOCAL  
 #define AF_UNIX     PF_UNIX  
 ......  
 #define AF_INET     PF_INET  
 ......  
 #define AF_INET6    PF_INET6  
 ......  

type argument determines the type of the socket, which further determines the communication characteristic.
 ubuntu@ip-172-31-23-227:~$ less /usr/include/x86_64-linux-gnu/bits/socket_type.h  
 ......  
 /* Types of sockets. */  
 enum __socket_type  
 {  
  SOCK_STREAM = 1,       /* Sequenced, reliable, connection-based  
                   byte streams. */  
 #define SOCK_STREAM SOCK_STREAM  
  SOCK_DGRAM = 2,        /* Connectionless, unreliable datagrams  
                   of fixed maximum length. */  
 #define SOCK_DGRAM SOCK_DGRAM  
  SOCK_RAW = 3,         /* Raw protocol interface. */  
 #define SOCK_RAW SOCK_RAW  
  SOCK_RDM = 4,         /* Reliably-delivered messages. */  
 #define SOCK_RDM SOCK_RDM  
  SOCK_SEQPACKET = 5,      /* Sequenced, reliable, connection-based,  
                   datagrams of fixed maximum length. */  
 #define SOCK_SEQPACKET SOCK_SEQPACKET  
  SOCK_DCCP = 6,        /* Datagram Congestion Control Protocol. */  
 #define SOCK_DCCP SOCK_DCCP  
  SOCK_PACKET = 10,       /* Linux specific way of getting packets  
                   at the dev level. For writing rarp and  
                   other similar things on the user level. */  
 #define SOCK_PACKET SOCK_PACKET  
   
  /* Flags to be ORed into the type parameter of socket and socketpair and  
    used for the flags parameter of paccept. */  
   
  SOCK_CLOEXEC = 02000000,   /* Atomically set close-on-exec flag for the  
                   new descriptor(s). */  
 #define SOCK_CLOEXEC SOCK_CLOEXEC  
  SOCK_NONBLOCK = 00004000   /* Atomically mark descriptor(s) as  
                   non-blocking. */  
 #define SOCK_NONBLOCK SOCK_NONBLOCK  
 };  
 ......  

protocol argument is usually zero, to select the default protocol for the given domain and socket type.
Default protocol for SOCK_STREAM socket in the AF_INET communication domain is TCP.
Defautl protocol for SOCK_DGRAM socket in the AF_INET communication domain is UDP.

Note: calling "socket" system call is similar from calling "open". You get a file descriptor that can be used for I/O. When we are done, we call "close" to relinquish the access of it.

Note: although socket descriptor is one file descriptor, but it doesn't mean we apply all system calls using file descriptors on socket descriptors, for example, lseek(socket doesn't have concept of offset).

3. Socket Type
1) SOCK_DGRAM Interface: no logical connection needs to exist between peers for them to communicate. A datagram provides a connection-less service.

A datagram is a self-contained message, which includes the counter-party address. Sending a datagram is analogous to mailing someone a letter. You can mail many letters, but you can't guarantee the order of delivery, and some might get lost along the way.

2) SOCK_STREAM Interface: requires that before exchanging data, you setup a logical connection between your socket and the socket belonging to the peer you want to communicate with.

Message contain no addressing information, as a point-to-point virtual connection exists between both ends, and the connection itself implies a particular source and destination.

With a SOCK_STREAM socket, applications are unaware of message boundaries. When we read data from a socket, it might not return the same number of bytes written by the process sending data. We will eventually get everything sent to us, but it might take several function calls.

3) SOCK_SEQPACKET Interface: it is same as SOCK_STREAM socket except that we get a message-based service instead of a byte-stream service. The amount of data received from a SOCK_SEQPACKET socket is the same amount as was written.

OS just "pack up" the entire message as in same packet and send to us.

4. Shut down socket
Communication on a socket is bidirectional. We can disable the I/O on a socket with the shutdown function.

 ubuntu@ip-172-31-23-227:~$ less /usr/include/x86_64-linux-gnu/sys/socket.h  
 ......  
 /* Shut down all or part of the connection open on socket FD.  
   HOW determines what to shut down:  
    SHUT_RD  = No more receptions;  
    SHUT_WR  = No more transmissions;  
    SHUT_RDWR = No more receptions or transmissions.  
   Returns 0 on success, -1 for errors. */  
 extern int shutdown (int __fd, int __how) __THROW;  
 ......  

close is similar from shutdown, but it deallocate the entire socket(only if the socket descriptor is not duplicated with dup, if that is the case, the socket will not be deallocated until the last socket descriptor is closed).

Unix Prog: Client Server Model

1. Model A:


In this model, parent process is the client process. Client process "fork" one child process, acting as the server, processing the request and pass back to client process(parent).

1) The server can be a set-user-ID program, giving it special privileges.
2) The server can determine the real identity of the client by looking at the real user id
3) With the client identity we can do additional permission checking on server side, to determine whether server should give the file back to client based on its user id
4) Problem: child can pass back the file content back to parent, this works for regular file, but not working for device files. Also, child process can not pass the file descriptor back to the parent.

2. Model B:

In this model, parent has one well-know FIFO, which receives requests from clients. And for each client, there is one specific FIFO to receive the reply from server.

3. Model C:
Same as above, just replace FIFOs with message queues.

In that case, client send their requests through well known message queue, also, each client create their own message queue with key "IPC_PRIVATE". Message sent to server must contain the client specific message queue id. In the future(server already know the message queue id of specific clients), all requests and replies can go though client specific message queues.

Problems with message queues:
1) each client specific message usually just have one message on it: a request from client or a response from the server. And OS has upper limit of how many IPC structures can exist system wide
2) If server need to read from multiple message queues, there is no I/O multiplexing like select/poll working with message queues. But if using FIFO(the file in file system), we can use select/poll to read from multiple FIFOs.

4. Model D:

Message Queues:
1) A single queue can be used between the server and all the clients, using the type field of each message to indicate the message recipient. If the type is 1, we can assume the message is for server, if the type is child process id, we can let child process to pick up their own messages.

For Model B, C, D:
When using the message queue, it is very difficult for server to identify the client(know the client's effective user id). Based on msg_lspid of queue, server can know the client process's process id, but there is no way for server to identify the effective user id of client to do the permission checking.

When using FIFOs, the server calls either stat or fstat on the client-specific FIFO, to get the owner id of the FIFO file. Here we assume the client's effective user id is the owner id of the FIFO file.

When using other IPC Structure(message queue, semaphore, shared memory), we can let client create their own IPC structure, then server use ipc_perm structure associated with each IPC structure to identify the creator(cuid and cgid fields)

Unix Prog: IPC Shared Memory(2)

1. Example of memory layout
shm.c:
 #include<stdio.h>  
 #include<stdlib.h>  
 #include<unistd.h>  
 #include<sys/shm.h>  
   
 #define ARRAY_SIZE 40000  
 #define MALLOC_SIZE 100000  
 #define SHM_SIZE 100000  
 #define SHM_MODE 0600 // user-read(S_IRUSR 0400), user-write(S_IWUSR 0200), BIT OR => 0600  
   
 char array[ARRAY_SIZE]; // Uninitialized data = bss  
   
 int main(int argc, char* argv[])  
 {  
  int shmid;  
  char *ptr, *shmptr;  
   
  // Print out address range of uninitialied array and local variable  
  printf("array[] from %lx to %lx\n", (unsigned long)&array[0], (unsigned long)&array[ARRAY_SIZE]);  
  printf("stack around %lx\n", (unsigned long)&shmid);  
   
  // Malloc the memory from heap, and print out the address range  
  if((ptr = malloc(MALLOC_SIZE)) == NULL) {  
   printf("malloc error");  
   exit(1);  
  }  
  printf("malloced from %lx to %lx\n", (unsigned long)ptr, (unsigned long)ptr + MALLOC_SIZE);  
   
  // Create the shared memory segment  
  if((shmid = shmget(IPC_PRIVATE, SHM_SIZE, SHM_MODE)) < 0) {  
   printf("shmget error!\n");  
   exit(2);  
  }  
   
  // Attach the local memory to shared memory  
  if((shmptr = shmat(shmid, 0, 0)) == (void*)-1) {  
   printf("shmat error!\n");  
   exit(3);  
  }  
  printf("shared memory attached from %lx to %lx\n", (unsigned long)shmptr, (unsigned long)shmptr+SHM_SIZE);  
   
  // Remove the shared memory segment  
  if(shmctl(shmid, IPC_RMID, 0) < 0) {  
   printf("shmctl error!\n");  
   exit(4);  
  }  
   
  exit(0);  
 }  

shell:
From the following input, we could see the places of different memory in process of current os.
Based on output:
uninitialized data(array) is in lowest address range, heap memory range is right after, and then stack memory, and finally shared memory is in highest memory range.
 ubuntu@ip-172-31-23-227:~$ ./shm.out  
 array[] from 6010a0 to 60ace0  
 stack around 7fffb5fc6b5c  
 malloced from 19b5010 to 19cd6b0  
 shared memory attached from 7f07cdc82000 to 7f07cdc9a6a0  

2. Example of Memory Mapping:
Besides using the shared memory segment, we can also use the mmap to simplify the procedure. mmap can used to share the memory region among multiple processes if they share the common ancestor and MAP_SHARED flag is specified.

mmap.c:
 #include<stdio.h>  
 #include<stdlib.h>  
 #include<fcntl.h>  
 #include<sys/mman.h>  
   
 #define NLOOPS 1000  
 #define SIZE sizeof(long)  
 int pfd1[2], pfd2[2];  
   
 void TELL_WAIT(void)  
 {  
  // Create two pipes here, one for parent -> child  
  // one for child -> parent  
  if(pipe(pfd1) < 0 || pipe(pfd2) < 0) {  
   printf("pipe error!\n");  
   exit(1);  
  }  
 }  
   
 void TELL_PARENT(pid_t pid)  
 {  
  if(write(pfd2[1], "c", 1) != 1) {  
   printf("write error !\n");  
   exit(2);  
  }  
 }  
   
 void WAIT_PARENT(void)  
 {  
  char c;  
   
  if(read(pfd1[0], &c, 1) != 1) {  
   printf("read error!\n");  
   exit(3);  
  }  
   
  if(c != 'p') {  
   printf("wait parent: incorrect data!\n");  
   exit(4);  
  }  
 }  
   
 void TELL_CHILD(pid_t pid)  
 {  
  if(write(pfd1[1], "p", 1) != 1) {  
   printf("write error!\n");  
   exit(2);  
  }  
 }  
   
 void WAIT_CHILD(void)  
 {  
  char c;  
   
  if(read(pfd2[0], &c, 1) != 1) {  
   printf("read error!\n");  
   exit(3);  
  }  
   
  if(c != 'c') {  
   printf("wait child: incorrect data!\n");  
   exit(4);  
  }  
 }  
   
 int update(long* ptr)  
 {  
  return((*ptr)++); // return the value before increment  
 }  
   
 int main(int argc, char* argv[])  
 {  
  int fd, i, counter;  
  pid_t pid;  
  void *area;  
   
  // Open the file descriptor  
  if((fd = open("/dev/zero", O_RDWR)) < 0) {  
   printf("open error!\n");  
   exit(1);  
  }  
   
  // Map the memory to the file descriptor  
  if((area = mmap(0, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0)) == MAP_FAILED) {  
   printf("mmap error!\n");  
  }  
  close(fd);  
   
  TELL_WAIT();  
  if((pid = fork()) < 0) {  
   printf("fork error!\n");  
   exit(5);  
  }  
  else if(pid > 0) { // parent process  
   for(i = 0; i < NLOOPS; i+=2) {  
    if((counter = update((long*)area))!= i) {  
     printf("parent: expected %d, got %d.\n", i, counter);  
     exit(6);  
    }  
   
    TELL_CHILD(pid);  
    WAIT_CHILD();  
   }  
  } else { // child process  
   for(i = 1; i < NLOOPS + 1; i+=2) {  
    WAIT_PARENT();  
    if((counter = update((long*)area) != i)) {  
     printf("child: expected %d, got %d\n", i, counter);  
     exit(6);  
    }  
   
    TELL_PARENT(getppid());  
   }  
  }  
   
  exit(0);  
 }  

shell:
Nothing is outputted.
So the parent and child process operate the mapped memory region pointed to by "area", and value of memory is incremented as expected while both processes touch it one by one.
Actually area in parent and child are different memory region, but they are mapped by using mmap to map it to the external /dev/zero file, which indicates that the external map could be empty.

Note: if we specify the MAP_PRIVATE when doing the mmap, it would not work. since this flag will make the memory to map to a private copy of file instead of shared one, which make another process can't see the change in memory since they are different and not mapped.

Note:
Modern unix system support anonymous memory mapping, which is almost same as above example, the only difference is:
We don't use /dev/zero to do the mapping, we let OS to select one anonymous region for shared mapping, we can change mmap statement to:

if((area = mmap(0, SIZE, PROT_READ | PROT_WRITE, MAP_ANON | MAP_SHARED, -1, 0)) == MAP_FAILED)

We add one more flag: MAP_ANON to indicate the anonymous mapping, and fd as -1, since we don't need to specify the file path.

And the result is, OS will select one anonymous region for mapping.

Saturday, November 15, 2014

Unix Prog: IPC Shared Memory(1)

1. Shared Memory Overview
Shared memory allows two or more processes to share a given region of memory. This is fastest form of IPC, because the data doesn't need to be copied between the client and the server.

But accessing to a given shared region among multiple processes need synchronization. Often, semaphores are used to synchronize shared memory access.

The kernel maintains a structure with at least the following members for each shared memory segment:
 ubuntu@ip-172-31-23-227:~$ less /usr/include/x86_64-linux-gnu/bits/shm.h  
 ......  
 /* Data structure describing a shared memory segment. */  
 struct shmid_ds  
  {  
   struct ipc_perm shm_perm;      /* operation permission struct */  
   size_t shm_segsz;          /* size of segment in bytes */  
   __time_t shm_atime;         /* time of last shmat() */  
 #ifndef __x86_64__  
   unsigned long int __glibc_reserved1;  
 #endif  
   __time_t shm_dtime;         /* time of last shmdt() */  
 #ifndef __x86_64__  
   unsigned long int __glibc_reserved2;  
 #endif  
   __time_t shm_ctime;         /* time of last change by shmctl() */  
 #ifndef __x86_64__  
   unsigned long int __glibc_reserved3;  
 #endif  
   __pid_t shm_cpid;          /* pid of creator */  
   __pid_t shm_lpid;          /* pid of last shmop */  
   shmatt_t shm_nattch;        /* number of current attaches */  
   __syscall_ulong_t __glibc_reserved4;  
   __syscall_ulong_t __glibc_reserved5;  
  };  
 ......  

2. System Calls:
1) shmget
shmget is used to create a new shared memory or obtain an existing shared memory.
 ubuntu@ip-172-31-23-227:~$ less /usr/include/x86_64-linux-gnu/sys/shm.h  
 ......  
 /* Get shared memory segment. */  
 extern int shmget (key_t __key, size_t __size, int __shmflg) __THROW;  
 ......  

When a new segment is created, the following members of the shmid_ds structure are initialized:
ipc_perm structure is initialized. shm_lpid, shm_nattach, shm_atime, and shm_dtime are all set to 0. shm_ctime is set to the current time. shm_segsz is set to the size requested.

The size parameter is the size of the shared memory segment in bytes. Implementation will usually round up the size to a multiple of the system's page size.But if size is not the multiple of page size, the remainder of the last page will be unavailable. If creating the new shared memory segment, we must specify the size. If we are referencing the existing segment, we can specify the size to 0.

2) shmctl
shmctl is the catchall for various shared memory operations.
 ubuntu@ip-172-31-23-227:~$ less /usr/include/x86_64-linux-gnu/sys/shm.h  
 ......  
 /* Shared memory control operation. */  
 extern int shmctl (int __shmid, int __cmd, struct shmid_ds *__buf) __THROW;  
 ......  

Unix supports following commands:
 ubuntu@ip-172-31-23-227:~$ less /usr/include/x86_64-linux-gnu/bits/ipc.h  
 ......  
 /* Control commands for `msgctl', `semctl', and `shmctl'. */  
 #define IPC_RMID    0        /* Remove identifier. */  
 #define IPC_SET     1        /* Set `ipc_perm' options. */  
 #define IPC_STAT    2        /* Get `ipc_perm' options. */  
 ......  
 ubuntu@ip-172-31-23-227:~$ less /usr/include/x86_64-linux-gnu/bits/shm.h  
 ......  
 /* Commands for `shmctl'. */  
 #define SHM_LOCK    11       /* lock segment (root only) */  
 #define SHM_UNLOCK   12       /* unlock segment (root only) */  
 ......  

IPC_STAT: Fetch the shmid_ds structure for this segment, storing it in the structure pointed to by buf.
IPC_SET: Set the following three fields from the structure pointed to by buf in the shmid_ds structure associated with this shared memory segment: shm_perm.uid, shm_perm.gid, and shm_perm.mode. It can only be executed by a process whose effective user ID equals shm_perm.cuid or shm_perm.uid or by a process with superuser privileges.
IPC_RMID: Remove the shared memory segment set from the system. The segment is not removed until the last process using the segment terminates or detaches it. It can only be executed by a process whose effective user ID equals shm_perm.cuid or shm_perm.uid or by a process with superuser privileges.
SHM_LOCK: Lock the shared memory segment in memory. only can be executed by super user.
SHM_UNLOCK: Unlock the shared memory segment. only can be executed by super user

3) shmat:
shmat is used to help process attach the shared memory segment to its own address space.

 ubuntu@ip-172-31-23-227:~$ less /usr/include/x86_64-linux-gnu/sys/shm.h  
 ......  
 /* Attach shared memory segment. */  
 extern void *shmat (int __shmid, const void *__shmaddr, int __shmflg)  
    __THROW;  
 ......  

a) If addr is 0, the segment is attached at the first available address selected by the kernel. This is the recommended technique.
b) If addr is nonzero and SHM_RND is not specified, the segment is attached at the address given by addr.
c) If addr is nonzero and SHM_RND is specified, the segment is attached at the address given by (addr - (addr % SHMLBA)). It round the address down to the next multiple of SHMLBA.

 ubuntu@ip-172-31-23-227:~$ less /usr/include/x86_64-linux-gnu/bits/shm.h  
 ......  
 /* Flags for `shmat'. */  
 #define SHM_RDONLY   010000     /* attach read-only else read-write */  
 #define SHM_RND     020000     /* round attach address to SHMLBA */  
 #define SHM_REMAP    040000     /* take-over region on attach */  
 #define SHM_EXEC    0100000     /* execution access */  
 ......  
 /* Segment low boundary address multiple. */  
 #define SHMLBA     (__getpagesize ())  
 extern int __getpagesize (void) __THROW __attribute__ ((__const__));  
 ......  

d) if SHM_RDONLY bit is set, the segment is attached read-only, otherwise, the segment is attached read-write.
e) shmat return the address at which the segment is attached, or -1 if an error occured. If shmat succeeds, the kernel will increment the shm_nattach counter in the shmid_ds structure associated with the shared memory segment.

4) shmdt:
shmdt is used to detach the memory segment from the local address.
Note: this doesn't remove the shared memory segment from the system. It still exists until some process specifically removes it by calling shmctl with a command of IPC_RMID.
 ubuntu@ip-172-31-23-227:~$ less /usr/include/x86_64-linux-gnu/sys/shm.h  
 ......  
 /usr/include/x86_64-linux-gnu/sys/shm.h:extern int shmdt (const void *__shmaddr) __THROW;  
 ......  

The addr argument is the value that was returned by a previous call to shmat. If successful, shmdt will decrement the shm_nattch counter in the associated shmid_ds structure.

Unix Prog: IPC Semaphores

1. Semaphore Overview
A semaphore is a counter used to provide access to a shared data object for multiple processes.

To obtain a shared resource, a process needs to:
1) Test the semaphore that controls the resource
2) If the value of the semaphore is positive, the process can use the resource. In this case, the process decrements the semaphore value by 1.
3) If the value of the semaphore is 0, the process goes to sleep until the semaphore value is greater than 0

To implement semaphores correctly, the test of a semaphore's value and the decrementing of this value must be an atomic operation.. So, semaphores are normally implemented inside the kernel

2. XSI Semaphore

XSI Semaphore is more complicated.
1) A semaphore is not simply a single non-negative value. Instead, we have to define a semaphore as a set of one or more semaphore values.
2) The creation of a semaphore(semget) is independent of its initialization(semctl)
3) All forms of XSI IPC remain in existence even when no process is using them, we have to worry about a program that terminates without releasing the semaphores it has been allocated.

The kernel maintains a semid_ds structure for each semaphore set:
 ubuntu@ip-172-31-23-227:~$ less /usr/include/x86_64-linux-gnu/bits/sem.h  
 ......  
 /* Data structure describing a set of semaphores. */  
 struct semid_ds  
 {  
  struct ipc_perm sem_perm;       /* operation permission struct */  
  __time_t sem_otime;          /* last semop() time */  
  __syscall_ulong_t __glibc_reserved1;  
  __time_t sem_ctime;          /* last time changed by semctl() */  
  __syscall_ulong_t __glibc_reserved2;  
  __syscall_ulong_t sem_nsems;     /* number of semaphores in set */  
  __syscall_ulong_t __glibc_reserved3;  
  __syscall_ulong_t __glibc_reserved4;  
 };  
 ......  

For each semaphore in semaphore set:
semval: current semaphore value, always >= 0
sempid: pid for last operation
semncnt: number of processes awaiting semval > curval
semzcnt: number of processes awaiting semval == 0

3. semaphore system calls
1) semget
 ubuntu@ip-172-31-23-227:~$ less /usr/include/x86_64-linux-gnu/sys/sem.h  
 ......  
 /* Get semaphore. */  
 extern int semget (key_t __key, int __nsems, int __semflg) __THROW;  
 ......  

semget can be used to create a new semaphore or obtain the semaphore id of existing one.

When a new semaphore is created, following members of semid_ds structure are initialized:
ipc_perm structure is initialized; set_otime is set to 0, sem_ctime is set to the current time, sem_nsems is set to nsems.

nsems represents the number of semaphores in set. If creating new semaphore set, we must specify nsems. If referring to existing semaphore set, nsems could be 0.

2) semctl
 ubuntu@ip-172-31-23-227:~$ less /usr/include/x86_64-linux-gnu/sys/sem.h  
 ......  
 /* Semaphore control operation. */  
 extern int semctl (int __semid, int __semnum, int __cmd, ...) __THROW;  
 ......  
semctl function is the catch all for various semaphore operations. Fourth argument is optional, depending on the command requested, and if present, is of type semun.

 ubuntu@ip-172-31-23-227:~$ less /usr/include/linux/sem.h  
 ......  
 /* arg for semctl system calls. */  
 union semun {  
     int val;            /* value for SETVAL */  
     struct semid_ds *buf;  /* buffer for IPC_STAT & IPC_SET */  
     unsigned short *array; /* array for GETALL & SETALL */  
     struct seminfo *__buf; /* buffer for IPC_INFO */  
     void *__pad;  
 };  
   
 struct seminfo {  
     int semmap;  
     int semmni;  
     int semmns;  
     int semmnu;  
     int semmsl;  
     int semopm;  
     int semume;  
     int semusz;  
     int semvmx;  
     int semaem;  
 };  
 ......  

All commands supported:
 ubuntu@ip-172-31-23-227:~$ less /usr/include/x86_64-linux-gnu/bits/ipc.h  
 ......  
 /* Control commands for `msgctl', `semctl', and `shmctl'. */  
 #define IPC_RMID    0        /* Remove identifier. */  
 #define IPC_SET     1        /* Set `ipc_perm' options. */  
 #define IPC_STAT    2        /* Get `ipc_perm' options. */  
 ......  
 ubuntu@ip-172-31-23-227:~$ less /usr/include/x86_64-linux-gnu/bits/sem.h  
 ......  
 /* Commands for `semctl'. */  
 #define GETPID     11       /* get sempid */  
 #define GETVAL     12       /* get semval */  
 #define GETALL     13       /* get all semval's */  
 #define GETNCNT     14       /* get semncnt */  
 #define GETZCNT     15       /* get semzcnt */  
 #define SETVAL     16       /* set semval */  
 #define SETALL     17       /* set all semval's */  
 ......  

IPC_STAT: Fetch the semid_ds structure for this set, storing it in the structure pointed to by semun->buf
IPC_SET: Set the sem_perm.uid, sem_perm.gid and sem_perm.mode fields from the structure pointed to by semun->buf in the semid_ds structure associated with this set.
IPC_RMID: Remove the semaphore set from the system. This removal is immediate.
GETVAL: Return the value of semval for the member semnum.
SETVAL: Set the value of semval for the member semnum.
GETPID: Return the value of sempid for the member semnum.
GETNCNT: Return the value of semncnt for the member semnum.
GETZCNT: Return the value of semzcnt for the member semnum
GETALL: Fetch all the semaphore values in the set. These values are stored in the array pointed to by semun-> array
SETALL: Set all the semaphore values in the set to the values pointed to by semun->array

3) semop:
semop atomically performs an array of operations on a semaphore set.
 ubuntu@ip-172-31-23-227:~$ less /usr/include/x86_64-linux-gnu/sys/sem.h  
 ......  
 /* Operate on semaphore. */  
 extern int semop (int __semid, struct sembuf *__sops, size_t __nsops) __THROW;  
 ......  

One semaphore operation can be represented by "struct sembuf"
 ubuntu@ip-172-31-23-227:~$ less /usr/include/x86_64-linux-gnu/sys/sem.h  
 ......  
 /* Structure used for argument to `semop' to describe operations. */  
 struct sembuf  
 {  
  unsigned short int sem_num;  /* semaphore number */  
  short int sem_op;       /* semaphore operation */  
  short int sem_flg;      /* operation flag */  
 };  
 ......  

sem_flg could be:
SEM_UNDO, IPC_NOWAIT

nsops represents the number "struct sembuf"

sem_op can be negative, 0, or positive.
a) If sem_op is positive: The value of sem_op is added to the specified semaphore's value. If undo flag is specified, sem_op is also subtracted from the semaphore's adjustment value for this process.
b) If sem_op is negative: we want to obtain resources that the semaphore controls. If the semaphore's value is greater than or equal to the absolute value of sem_op(the resources are available), the absolute value of sem_op is subtracted from the semaphore's value. If the undo flag is specfied, the absolute value of sem_op is also added to the semaphore's adjustment value for this process.

If the semaphore's value is less than the absolute value of sem_op(resources are not available), the following conditions apply:

If IPC_NOWAIT is specfied, semop returns with an error of EAGAIN

If IPC_NOWAIT is not specified,the semncnt value for this semaphore is incremented(one more process is going to sleep), and the calling process is suspended until:
the semaphore's value becomes greater than or equal to the absolute value of sem_op, or, the semaphore is removed from the system, or, a signal is caught by the process, and the signal handler returns, the value of semncnt for this semaphore is decremented(since the process is waken up), function returns with EINTR.

c) If sem_op is 0, this means that the calling process wants to wait until the semaphore's value becomes 0

If the semaphore's value is currently 0, the function returns immediately

If the semaphore's value is nonzero, the following conditions apply:
If IPC_NOWAIT is specified, return is made with an error of EAGAIN
If IPC_NOWAIT is not specified, the semzcnt value for this semaphore is incremented(one more process waiting for 0 semaphore value is going to sleep), and the calling process is suspended until:

the semaphore's value becomes 0, or the semaphore is removed from the system, or a signal is caught by the process and the signal handler returns.

Note: semop function operates atomically.

4. Semaphore Adjustment on exit:
Whenever we make semaphore option, we can specify SEM_UNDO flag to let semaphore "memorize" the adjustment value. So when the process terminates, the kernel checks whether the process has any adjustment value for the semaphore, if so, applies the adjustment to the corresponding semaphore.

For example, we have one semaphore value 1; then a process get the resource and decrease the semaphore value to 1 (sem_op is -1 in this case) while specifying the SEM_UNDO flag, then kernel memorizes the adjustment value of this process for this semaphore to be "+1". Then when process terminates, the adjustment will be applied to this semaphore making value from 0 to 1 again, allowing other processes to use the resource again.

If user use any SETVAL/SETALL command to manually setup the semaphore value, adjustment of this semaphore for all processes will become 0.

5. Time Comparison between record locking and semaphore

We use semaphore to control whether to allocate and release the resource.
We create the file using first file byte to control  whether to allocate and release the resource(lock the first byte or release the first byte)

Based on testing results:
semaphore is apparently faster than record locking, but if locking a single resource, and don't need all the fancy features of XSI semaphores, record locking is preferred.