Fernando Blog

Referencias:
http://www.unicode.org/faq/utf_bom.html
http://www.joelonsoftware.com/

Tomado de: http://www.joelonsoftware.com/
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
By Joel Spolsky
Wednesday, October 08, 2003

A cada letra de todos los alfabetos del mundo, le es asignado un "magic number" en UNICODE.
Ej: U+0639. Este número mágico es llamado "code point".
La U significa Unicode y el número está en hexadecimal.
Se pueden ver con la utilidad charmap de Windows 2000/XP o visitando el sitio deUNICODE www.unicode.org

No hay un límite real del número de letras que Unicode puede definir y en realidad son mucho mas de 65536, de modo que no es posible representar todo esto en dos bytes (F= 255 FF=65536)

Encoding es LA MANERA como se almacenan (en memoria, archivo etc.) los code points de UNICODE.
Existen muchos para guardar code points de UNICODE.
UCS-2 o UTF-16 con Unicode Byte Order Mark
UTF-8 (tiene la ventaja que es de un byte e igual al ASCII del hasta el 127) de ahí en adelante es otro cuento.
UTF-7

There are hundreds of traditional encodings which can only store some code points correctly and change all the other code points into question marks. Some popular encodings of English text are Windows-1252 (the Windows 9x standard for Western European languages) and ISO-8859-1, aka Latin-1 (also useful for any Western European language). But try to store Russian or Hebrew letters in these encodings and you get a bunch of question marks. UTF 7, 8, 16, and 32 all have the nice property of being able to store any code point correctly.

UCS-2 (two byte) Unicode, which is what Visual Basic, COM, and Windows NT/2000/XP use as their native string type. In C++ code we just declare strings as wchar_t ("wide char") instead of char and use the wcs functions instead of the str functions (for example wcscat and wcslen instead of strcat and strlen). To create a literal UCS-2 string in C code you just put an L before it as so: L"Hello".

Entonces como se hace para saber cual encoding tiene un documento, correo, página web etc?
Hay ciertos estándares:
For an email message, you are expected to have a string in the header of the form

Content-Type: text/plain; charset="UTF-8"

Para una página web:
<html>
<head>
<meta http-equiv="Content-Type"
content="text/html; charset=utf-8">

----------------------------------------------
iconv es un programa gnu-linux para convertir un archivo de un encoding a otro.
También existe otro, supuestamente mas moderno llamado convert o algo así. Lo tiene el ubuntu.
Con VIM tambien se puede haciendo: :w ++enc=latin1 o el que sea
----------------------------------------------
MODO DE PRUEBA:
utilizando el psPad y notepad2 grabé variosarchivos, cambiando en cada caso
el juego de caracteres:
Latin-1 (ISO-8859-1) no lo tiene definido el pspad

OEM es leido bien en DOS por el editor edit.exe
Pero no es leído bien por el bloc de notas de windows, que si es capaz de leer
ansi, unicode y utf-8

ansi dió igual a iso88859-2
UTF-8 es diferente a ansi y tiene mas bytes y unicode es mayor aún

Comparación de los contenidos de los archivos en hexadecimal: bh= base hexadecimal)
Notar que oem Y ANSI usan un byte, UTF-8 dos y unicode 4.

OEM
<bh:a0> <bh:82> <bh:a1> <bh:a2> <bh:a3> <bh:a4> <bh:a5>
ANSI:
<bh:e1> <bh:e9> <bh:ed> <bh:f3> <bh:fa> <bh:f1> <bh:d1>
UTF-8:
<bh:c3><bh:a1> <bh:c3><bh:a9> <bh:c3><bh:ad> <bh:c3><bh:b3> <bh:c3><bh:ba> <bh:c3><bh:b1> <bh:c3><bh:91>
UNICODE:
<bh:ff><bh:fe><bh:e1><bh:00> <bh:00><bh:e9><bh:00> <bh:00><bh:ed><bh:00> <bh:00><bh:f3><bh:00> <bh:00><bh:fa><bh:00> <bh:00><bh:f1><bh:00> <bh:00><bh:d1><bh:00>

----------------------------------------------------------------------------------------------------------------
Falta averiguar en los archivos html que efecto tiene esta declaración:

Internationalization

To allow representation of the world's languages, HTML 4.0 adopts the Universal Character Set as its character set.
Previous versions of HTML were restricted to ISO-8859-1, a character set that only handled some western European languages.
The Universal Character Set is character-by-character equivalent to Unicode 2.0 and contains characters for almost all
of the world's languages.
The LANG and DIR attributes are new in HTML 4.0 and apply to almost all elements.
These attributes allow authors to specify the language and directionality of text.
The BDO element allows authors to override the bidirectional algorithm used when right-to-left text such
as Hebrew is presented.
HTML 4.0 also offers new entities for easy entry of mathematical symbols and Greek letters as well as other
special characters.
Note that using META for this purpose rather than a true HTTP header causes some browsers to redraw the page
after initially displaying it.

Internationalization Attributes
Tomado de HTML 4.0 Common Attributes en la ayuda del html4.0

LANG
A document's primary language may be set using the LANG attribute on the HTML element, or, alternatively,
by using the Content-Language HTTP header.
The LANG attribute specifies the language of an element's attribute values and its content, including all
contained elements that do not specify their own LANG attribute.
While the LANG attribute is not widely supported, its use may help search engines index a document by its
language while allowing speech synthesizers to use language-dependent pronunciation rules.
As well, visual browsers can use the language's proper quotation marks when rendering the Q element.

The attribute value is case-insensitive, and should be specified according to RFC 1766;
examples include en for English, en-US for American English, and ja for Japanese.
Whitespace is not allowed in the language code.

Use of the LANG attribute also allows authors to easily change the style of text depending on the language.
For example, a bilingual document may have one language in italics if rendered visually or a different voice
if rendered aurally. The HTML of such a document might be as follows:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN"
"http://www.w3.org/TR/REC-html40/strict.dtd">
<TITLE>Welcome - Bienvenue</TITLE>
<H1>
<SPAN LANG=en>Welcome</SPAN> -
<SPAN LANG=fr>Bienvenue</SPAN>
</H1>
<P LANG=en>This paragraph is in English.</P>
<P LANG=fr>Ce paragraphe est en français.</P>
...

According to the HTML 4.01 specification, Web browsers are allowed to encode a form being submitted with a
character encoding different from the one used for the page. See mb_http_input() to detect character encoding
used by browsers.

Although popular browsers are capable of giving a reasonably accurate guess to the character encoding of a
given HTML document, it would be better to set the charset parameter in the Content-Type HTTP header to
the appropriate value by header() or default_charset ini setting.

---------------------------------------------------------------------------------------------

EN PHP las funciones para manipular esto están en:
Multibyte String Functions

---------------------------------------------------------------------------------------------
Como hace Notepad2 que esta escrito con el API DE windows. Tengo el código fuente
de él!!!

1.3.9 What does 'UnicodeError: ASCII [decoding,encoding] error: ordinal not in range(128)' mean?
This error indicates that your Python installation can handle only 7-bit ASCII strings.
There are a couple ways to fix or work around the problem.

If your programs must handle data in arbitary character set encodings, the environment the application
runs in will generally identify the encoding of the data it is handing you.
You need to convert the input to Unicode data using that encoding.
For example, a program that handles email or web input will typically find character set encoding
information in Content-Type headers.
This can then be used to properly convert input data to Unicode.
Assuming the string referred to by value is encoded as UTF-8:

value = unicode(value, "utf-8")
will return a Unicode object. If the data is not correctly encoded as UTF-8, the above call will raise
a UnicodeError exception.

If you only want strings coverted to Unicode which have non-ASCII data, you can try converting
them first assuming an ASCII encoding, and then generate Unicode objects if that fails:

try:
x = unicode(value, "ascii")
except UnicodeError:
value = unicode(value, "utf-8")
else:
# value was valid ASCII data
pass
It's possible to set a default encoding in a file called sitecustomize.py that's part of the Python library.
However, this isn't recommended because changing the Python-wide default encoding may cause third-party
extension modules to fail.

Note that on Windows, there is an encoding known as "mbcs", which uses an encoding specific to your
current locale.
In many cases, and particularly when working with COM, this may be an appropriate default encoding to use.

Para hacer programas en C++ builder que soporten unicode hay #include <tchar.h> y buen soporte con macros etc.

int_t fgetwc(FILE *stream);

#include <stdio.h>
int fgetc(FILE *stream); Versión normal
wint_t fgetwc(FILE *stream); Versión unicode

One approach to working with ideographic character sets is to convert all characters to a wide character encoding scheme such as Unicode. Unicode characters and strings are also called wide characters and wide character strings. In the Unicode character set, each character is represented by two bytes. Thus a Unicode string is a sequence not of individual bytes but of two-byte words.

The first 256 Unicode characters map to the ANSI character set. The Windows operating system supports Unicode (UCS-2). The Linux operating system supports UCS-4, a superset of UCS-2

--------------------------------------------------------------------------------
En que juego de caracteres estan los archivos HTML producidos por frontpage y NVU: R:/ ANSI

Que juego de caracteres usa Linux?
En ubuntu: utf-8 es el current locale e iso8859-1
En redhat 8: ISO-8859-1

--------------------------------------------------------------------------------
Although Unicode (usually the utf8 variant on Unix, and the ucs2 variant on Windows) is preferable to Latin, it's often not what your operating system utilities support best. Many Windows users find that a Microsoft character set, such as cp932 for Japanese Windows, is what's suitable.
-------------------------------------------------------------------------------

Titulo del tema